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Abstract 

We describe a fast method to eliminate features (variables) in ii-penalized least-square regression 
(or LASSO) problems. The elimination of features leads to a potentially substantial reduction in 
running time, especially for large values of the penalty parameter. Our method is not heuristic: 
it only eliminates features that are guaranteed to be absent after solving the LASSO problem. 
The feature elimination step is easy to parallelize and can test each feature for elimination inde- 
pendently. Moreover, the computational effort of our method is negligible compared to that of 
solving the LASSO problem - roughly it is the same as single gradient step. Our method extends 
the scope of existing LASSO algorithms to treat larger data sets, previously out of their reach. 
We show how our method can be extended to general l\ -penalized convex problems and present 
preliminary results for the Sparse Support Vector Machine and Logistic Regression problems. 
Keywords: Sparse Regression, LASSO, Feature Elimination, SVM, Logistic Regression 

1. Introduction 

"Sparse" classification or regress ion problems, wh ich involve an l\ — n orm regularization has attracted 
a lot of interest in the statistics ( Tibshiraml 19961 ) . signal processing ( Chen et al. . 2001 ). and machine 
learning communities. The l\ regularization leads to sparse solutions, which is a desirable property 
to achieve model selection, or data compression. For instance, consider the prob lem of £i-regularized 
least square regression commonly referred to as the LASSO ( Tibshirani . 19961 ). In this context, we 
are given a set of m observations a 2 ; £ M. n , i = 1, . . . , m and a response vector y £ M. m . Denoting 
by X = (oi, . . . , a m ) T £ R mxn the feature matrix of observations, the LASSO problem is given by 



V(X) : 0(A) 



1 2 

min- \\Xw-y\\ 2 + A | 
w 2 



(1) 



where A is a regularization parameter and w £ M. n is the optimization variable. For large enough 
values of A, any solution w* £ W l of (JT|) is typically sparse, i.e. w* has few entries that are non-zero, 
and therefore identifies the features in X (columns of X) that are useful to predict y. 

Sev eral efficient algo ri thms have been developed for the LASSO problem, including Efron et al 



( 2004l):lKim et all (l2007l);|Park and Hastiej j2007l) : lDonoho "and Tsaid |2008h : iFriedman et al.l(|2007f ) 



Becker et al. (2010); Friedman et al] (201C ) and references therein. However, the complexity of these 
algorithms, when it is known, grows fast with the number of variables. While the LASSO problem is 
particularly appealing in presence of very high-dimensional problems, the available algorithms can 
be quite slow in such contexts. In some applications, the feature matrix is so big that it can not 
even be loaded and LASSO solvers cannot be used at all. Hence it is of paramount interest to be 
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able to efficiently eliminate features in a pre-processing step, in order to reduce dimensionality and 
solve the optimization problem on a reduced matrix. 

Assume that a sparse solution exists to ([T]) and that we were able to identify e zeros of w* a 
priori to solving the LASSO problem. Identifying e zeros in w* a priori to solving (Q]) is equivalent 
to removing e features (columns) from the feature matrix X. If e is large, we can obtain w* by 
solving JTJ) with a "small" feature matrix X. 

In this paper we propose a "safe" feature elimination (SAFE) method that can identify zeros in 
the solution w* a priori to solving the LASSO problem. Once the zeros are identified we can safely 
remove the corresponding features and then solve the LASSO problem ([1]) on the reduced feature 
matrix. 

Feature selection methods are often used to accomplish dimensio nality reduct i on, an d are of 
utmost relevance for data sets of massive dimension, see for example Fan and Lv ( 2010h . These 



methods, w hen used as a pre-proces sing step, have been referred to in the literature as screening 
procedures ( Fan and Lv . 2010l 20081 ). They typically rely on univariate models to score features, 



independently of each other, and are usually computationally fast. Clas sical procedures are based on 
correlation coefficients, two-sample ^-statistics or chi-square statistics ( Fan and Lv . 2010h : see also 



Formanl (|2003f ) and the references therein for an overview in the specific case of text classification. 
Most screening methods might remove features that could otherwise have been selected by the 
regression or classification algorit hm. However, som e of them were recently shown to enjoy the so- 



called "sure screening" property (jFan and Lvl 120081) : under some technical conditions, no relevant 



feature is removed, with probability tending to one. 

Screening procedures typically ignore the specific classification task to be solved after feature 
elimination. In this paper, we propose to remove features based on the supervised learning problem 
considered, that is on both the structure of the loss function and the problem data. While we focus 
mainly on the LASSO problem here, we provide results for a large class of convex classification or 
regression problems. The features are eliminated according to a sufficient, in general conservative, 
condition, which we call SAFE (for SAfe Feature Elimination). With SAFE, we never remove 
features unless they are guaranteed to be absent if one were to solve the full-fledged classification or 
regression problem. 

An interesting fact is that SAFE becomes extremely aggressive at removing features for large 
values of the penalty parameter A. The specific application we have in mind involves large data sets 
of text documents, and sparse matrices based on occurrence, or other score, of words or terms in 
these documents. We seek extremely sparse optimal coefficient vectors, even if that means operating 
at values of the penalty parameter that are substantially larger than those dictated by a pure concern 
for predictive accuracy. The fact that we need to operate at high values of this parameter opens the 
hope that, at least for the application considered, the number of features eliminated by using our fast 
test is high enough to allow a dramatic reduction in computing time and memory requirements. Our 
experimental results indicate that for many of these data sets, we do observe a dramatic reduction 
in the number of variables, typically by an order of magnitude or more. The method has two main 
advantages: for medium- to large-sized problem, it enables to reduce the computational time. More 
importantly, SAFE allows to tackle problems that are too huge to be even loaded in memory, thereby 
expanding the reach of current algorithms 

The paper is organized as follows. In section [21 we derive the SAFE method for the LASSO 
problem. In section [31 we illustrate the use of SAFE and detail some relevant algorithms. In section 
21 we extend the results of SAFE to general convex problems and derive preliminary SAFE results for 
the Sparse Support Vector Machine and Logistic regression problems. In section [5l we experiment 
the SAFE for LASSO method on synthetic data and on data derived from text classification sources. 
Numerical results demonstrate that SAFE provides a substantial reduction in problem size, and, as 
a result, it enables the LASSO algorithms to run faster and solve huge problems originally out of 
their reach. 
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Notation. We use 1 and to denote a vector of ones and zeros, with size inferred from context, 
respectively. For a scalar a, a_|_ denotes the positive part of a. For a vector a, this operation is 
component-wise, so that l T a+ is the sum of the positive elements in a. We take the convention that 
a sum over an empty index sets, such as a i with k < 0, is zero. 



2. The SAFE method for the LASSO 

The SAFE method crucially relies on duality and optimality conditions. We begin by reviewing the 
appropriate facts. 



2.1 Dual problem and optimality conditions for the LASSO 



A dual to the LASSO problem (fTJ) (|Kim et all 120071 ) can be written as 



V(X) : (f>{\) := max G{8) : \0 T x k \ < A, k = 1, . 



(2) 



with Xk € M m , k = 1, . . . , n, the fc-th column of X and G(8) 



\y\\l 



9 + y\\ 2 . In this context, 



we call V(X) the primal problem, w the primal variable, and w* a primal optimal point. The dual 
problem T>(X) is a convex optimization problem with dual variable 8 <G K m . We call 8 dual feasible 
when it satisfies the constraints in T>(X). Figure 1(a) shows the geometry of the feasibility set in the 
dual space. The quantity G{8) gives a lower bound on the optimal value 4>(X) for any dual feasible 



point 8, i.e. G{8) < 0(A), \8 T x k \ < A, k = 1, . . 



,n. 



For the LASSO problem JTJ) strong duality 



holds and the optimal value of T>(X) achieves 4>(X) at 8* the solution of ([2]) or the dual optimal point. 
Furthermore, the following relation holds at optimum: 8* = Im* — y. 

We consider the dual problem T>(X) because of an important property that helps us derive our 
SAFE method. Assuming w* is s parse, knowledge of 8* a l lows u s to identify the zeros in w* by 
checking the optimality condition ( Bovd and Vandenberghei 2004J ): 



x k \ < X => (w*) k = 0. 



(3) 



Figure 1(b) illustrates the geometric interpretation of the inequality test 0* T Xfc < A in ((3]). 



2.2 Basic idea 

The basic idea behind SAFE is to use the optimality condition ([3]) with 8* in the inequality test 
replaced by a set O that contains the dual optimal point, i.e. |0 T a;/ £ | < A, V6> <G © and 8* £ O. If 
the inequality test holds for the whole set O, then the fc-th entry of w* is zero, (w*)k = 0. 

In the following sections, we show how to construct the set O using optimality conditions of the 
dual problem, and derive the corresponding SAFE test. 

In our derivation, we assume that we have knowledge of a solution Wq of "P(Ao) for some Ao, and 
we seek to apply SAFE for "P(A) with A < Ao- By default, we can choose Ao to be large enough for 
Wq to be identically zero. To find such a Ao, we substitute Wq = in ((TJ to obtain 4>{Xq) = \ ||y||2- 
By strong duality, £>(Ao) achieves a value of </>(Ao) = \ ||y||2 at the unique solution 8^ = —y. The 
point 8q is a dual feasible point and satisfies the constraints Ao > |(— y) T Xk \ , k = 1, . . . ,n. Note 
that Ao is not uniquely defined but we choose the smallest value above which Wq = 0, that is 
A = maxK^n ly 7 ^! = \\X T y\\, x . 



2.3 Constructing O 

We start by finding a set 9 that contains the dual optimal point 8* of T>(X). We express O as the 
intersection of two sets Oi and O2, where each set corresponds to different optimality conditions. 
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(a) (b) 

Figure 1: Geometry of the dual problem T>(\). (a) Feasibility set of the dual problem. The grey 
shaded polytope shows the feasibility set of T>(\). The feasibility set is the intersection 
of n slabs in the dual space corresponding to the n features Xk, k = 1, . . . , n. The level 
set G{9) = 71, where 71 = G(6*), corresponds to the optimal value of the dual function 
and is tangent to the feasibility set at the dual optimal point 6*. (b) Geometry of the 
inequality test in ([3]). The grey shaded region is the slab corresponding to feature Xk, i.e. 
{0 I |0 T £fc| < A}. The test |0* T a;fc| < A is a strict inequality when the point 8* is in the 
interior of the slab defined by the feature Xk- Thus if the dual optimal point is inside a 
slab defined by feature Xk , by optimality condition ([3]) the fc-th entry of the primal optimal 
solution w* is zero, i.e. (w*)k = 0. 
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We construct Gi using the optimality condition of T>(\): 9* is a dual optimal point if G(9*) > 
G(9) for all dual feasible points 9. Let 9 S be a dual feasible point to T>(X), and 7 := G(9 S ). Obviously 
G{0*) > 7 and the set 9i := {9 | G{9) > 7} contains 6*, i.e. 6* G X . 

One way to obtain a lower bound 7 is by dual scaling. We set 9 S to be a scaled feasible dual 
point in terms of 9$, 9 S := s9$ with seR constrained so that 9 S is a dual feasible point for T>(\), 
that is, ||X T 6' S || 00 < A or |s| < A/Ao- We then set 7 according to the convex optimization problem: 

7 = max \g(s9q) : \s\ < I = max (/3 s - ^s 2 a : |s| < "r" 
s L A J s |^ l A 

with a := 6fei > 0, P := |?/ T (95|. We obtain 

2a \ V ftAo 

We construct O2 by applying a first order optimality condition on T>(Xq): 9^ is a dual optimal 
point if g T (9o — 9q) < for every dual point 9q that is feasible for 2?(Ao), where g := VG(6>q) = 9^+y. 
For A < Ao, any dual point 9 feasible for T>(\) is also dual feasible for 2?(Ao) (|6* T :z;fc| < A < Aq k = 
l,...,n). Since 0* is dual feasible for £>(A ), we conclude 6* G 2 := {9 \ g T {9 - flg j < |. 



Figure 2(a) shows the geometry of Oi, O2 and O in the dual space; Figure 2(b) shows the 



geometric interpretation of the inequality test when it is applied to the set O. 
2.4 SAFE-LASSO theorem 

Our criterion to identify the A;-th zero in w* and thus remove the fc-th feature (column) from the 
feature matrix X in problem V(X) becomes 

A > \9 T x k \ = m&x(9 T x k , -9 T x k ) : 9 G 6. (5) 

An equivalent formulation of condition ([5]) is 

A > max(P('y,Xk),P{'j,-x k )), 

where P(j,x k ) is the optimal value of a convex optimization problem with constraints 9 G Oi and 
9 G 6 2 : 

P( 7 , x k ) := max x T k 9 : G{9) > 7, g T {9 - 9*) > 0. (6) 


It turns out that the above problem is simple enough to admit a closed-form solution (see Ap- 
pendix The resulting test can be summarized as follows. 

Theorem (SAFE-LASSO) Consider the LASSO problem V{\) in {7p. Let A > A be a value 
for which an optimal solution Wq G K' 1 is known. Denote by x k the k-th feature (column) of the 
matrix X . Define 

£ = {k\\> max(P(7, x k ), P( 7 , -a*)} , (?) 



where 



p( )= ,et T x k + V k D( 7 ) \\9\\i\\x k \\ 2 > D(j)xlg, 

S -V T x k + \\x k \\ 2 D(>y) \\g\\ 2 2 \\x k \\ 2 <D( 7 )xlg, 



with 



9* =Xw* -y, g:=6* +y, a := 9f9* , [3 := \y T 6* Q \, 7 := ^ U - (l - 

Dh) = (\\y\\l 2 7 ) 1/2 , D( 7 ) = (£(7) 2 - llgll^) 172 , * fe := [\\x k f 2 - {j ^Pj 1/2 
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y:=G(s8;) 




(a) (b) 



Figure 2: (a) Sets containing 9* in the dual space. The set Ox := {9 \ G(9) > 7} shown in red 
corresponds to a ball in the dual space with center —y. The set 02 := {0 \ g T (9 — 9q) < 0} 
with g := VG(#q) shown in yellow corresponds to a half space with supporting hyperplane 
passing through 9q and normal to VG(#q). The set = Oi n 62 shown in orange contains 
the dual optimal point 9*. (b) Geometry of the inequality test |# T a;fc| < A, \/9 £ 0. The 
grey shaded region is the slab corresponding to feature Xk, i.e. {9 | 9 T Xk < A}. The test 
|# T a:fc| < A, W6 £ is a strict inequality when the entire set (shown in orange) is inside 
the slab defined by the feature Xk- In such case, the dual optimal point 9* <E is also 
inside the slab and by ([3]) we conclude {w*)k = 0. 
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Then, for every index e £ £ , the e-th entry of w* is zero, i.e. (to*) = 0, and feature x e can be safely 
eliminated from X a priori to solving the LASSO problem ([!]]. ■ 
When we don't have access to a solution Wq of P(Ao), we can set Wq = and Ao = A max := 
||^ T y||oo- In this case, the inequality test A > max(P(7, Xk), P("y, — %k) m the SAFE-LASSO theo- 
rem takes the form A > PfcA max , with 

II2/II2 IMI2 + \y Tx k\ 

Pk ~ 



Il2/Il2ll x fc|| 2 + A max ' 

In the case of scaled data sets, for which \\y\\ 2 = 1 and ||;Efc|| 2 = 1 for every k, has a convenient 
geometrical interpretation: 

1 + |cosafe| 



Pk 



1 + max I cosa,- 

i<j<" 



where au is the angle between the fc-th feature and the response vector y. Our test then consists in 
eliminating features based on how closely they are aligned with the response, relative to the most 
closely aligned fe ature. For scaled data sets, our test is very similar to standard correlation-based 
feature selection ( Fan and Lv , l2008h : in fact, for scaled data sets, the ranking of features it produces 



is then exactly the same. The big difference here is that our test is not heuristic, as it only eliminates 
features that are guaranteed to be absent when solving the full-fledged sparse supervised learning 
problem. 

2.5 SAFE for LASSO with intercept problem 

The SAFE-LASSO theorem can be applied to the LASSO with intercept problem 
Pint (A) : <MA) := min - \\Xuu + v - y\\ 2 2 + A , 

w,v 2 

with v £ M. m the intercept term, by using a simple transformation. Taking the derivative of the 
objective function of Pint (A) w.r.t v and setting it to zero, we obtain v = y — X T w with y = 
(l/m)l T y, X = (l/m)Xl and 1 € R m the vector of ones . Using the expression of v, Pint (A) can 
be expressed as 

1 2 
Pnt(A) : 0(A) := min - LX" cent to - y cen t \\ 2 + A | to x , 

w Z 

with X cent := X — X1 T and y ccn t = V — yl. Thus the SAFE-LASSO theorem can be applied to P; n t 
and eliminate features (columns) from A ccnt . 

2.6 SAFE for elastic net 

The elastic net problem 



1 2 1 2 

Pciastic(A) : 4>(X) := min- \\Xw - y\\ 2 + A ||to \\ l + -e \\w" 
w 2 2 



2 • 

T 



can be expressed in the form of P(A) by replacing X and y of (TTJ) with A c i ast i c = (X T ,y/el) 

and i/ciastic = (j/ T jO T ) T . This transformation allows us to apply the SAFE-LASSO theorem on 
Peiastic(A) and eliminate features from A e i ast i c . 

3. Using SAFE 

In this section we illustrate the use of SAFE and detail the relevant algorithms. 
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3.1 SAFE for reducing memory limit problems 

SAFE can extend the reach of LASSO solvers to larger size problems than what they could originally 
handle. In this section, we are interested in solving for w* d the solution of V(Xd) under a memory 
constraint of loading only M features. We can compute w* d by solving a sequence of problems, 
where each problem has a number of features less than our memory limit M. We start by finding 
an appropriate A where our SAFE method can eliminate at least n — M features, we then solve a 
reduced size problem with Lp < M features, where Lp = \£ c \ is the number of features left after 
SAFE and £ c = {l,...,n}\£ is the complement of the set £ in the SAFE-LASSO theorem. We 
proceed to the next stage as outlined in algorithm Q] 



Algorithm 1 SAFE for reducing memory limit problems 

given a feature matrix X £ R mx ", response y £ M. m , penalty parameter Ad , memory limit M and 
LASSO solver: LASSO, i.e. w* = LASSOpT, y, A), 
initialize A = \\X T y\\ oc , mi *=0£ R", 
repeat 

1. Use SAFE to search for a A with LF < M . Obtain A and £. % Lp is the number of features 
left after SAFE and £ is the set defined in the SAFE-LASSO theorem. 

2. if A < Xd then A = Ad, apply SAFE to obtain £ end if. 

3. Compute the solution w*. w*(£ c ) = LASS0(A(£ c , :), y, A), w*(£) = 0; % w*(£ c ) and 
X(£ c , :) are the elements and columns of w* and X defined by the set £ c , respectively. 
£ c = {1, . . . , n} \£ is the complement of the set £ . 

4. A := A, w„ = w*. 
until Ao = Ad 



We use a bisection method to find an appropriate value of A for which SAFE leaves Lp £ 
[AI — ep 1 M] features, where ep is a number of feature tolerance. The bisection method on A is 
outlined in algorithm [2] 

Algorithm 2 Bisection method on A. 

given a feature matrix X £ R mx ™, response y £ M. m , penalty parameter Ao with LASSO solution 
Wq, tolerance cf > and memory limit M. 
initialize I = 0, and u = Xq. 
repeat 

1. Set A := (l + u) /2. 

2. Use the SAFE-LASSO theorem to obtain £. 

3. Set Lp = \£ c \. 

4. if Lp > M then set I := X else set u := X end if 
until M — Lp < ep and Lp < M. 



3.2 SAFE for LASSO run-time reduction 

In some applications like Gawalt et al. ( 2010h , it is of interest to solve a sequence of problems 
■p(Ai), . . .'P(Xs) for decreasing values of the penalty parameters, i.e. Ai > ... > A s . The compu- 
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tational complexities of LASSO solvers depend on the number of features and using SAFE might 
result in run-time improvements. For each problem in the sequence, we can use SAFE to reduce the 
number of features a priori to using our LASSO solver as shown in algorithm [3J 

Algorithm 3 Recursive SAFE for the Lasso 

given a feature matrix X <E W nxn , response y <E M. m , a sequence of penalty parameters A s < . . . < 
Ai < \\X T y\\ oa , and LASSO solver: LASSO, 
initialize A = \\X T y\\ QO , «i *=Oe K". 
for i = l until i = s do 

1. Set Aq = Aj_i, and A = Aj. 

2. Use the SAFE-LASSO theorem to obtain £. 

3. Compute the solution w*. w*(£ c ) = LASS0(A(£ c , :),y, A), w*(£) = 0. % w*(£ c ) and 
X(£ c , :) are the elements and columns of w* and X defined by the set £ c , respectively. 
£ c = {1, . . . , n} \£ is the complement of the set £ . 

4. Set Wq = w*. 
end for 



4. SAFE applied to general ^-regularized convex problems 

The SAFE-LASSO result presented in section l2~4l for the LASSO problem ([T]) can be adapted to a 
more general class of l\— regularized convex problems. We consider the family of problems 

m 

V(X) : 0(A) :=mmV /(<zf w + M + c;) + A Hli , (9) 

W, V J 

i=l 

where / is a closed convex function, and non- negative everywhere, ai £ R™, i = 1, . . . , m, b, c € R m 
are given. The LASSO problem (P) is a special case of © with /(C) = (1/2)C 2 , a* € K", i = 1, . . . ,m 
the observations, c = — y is the (negative) response vector, and 6 = 0. Hereafter, we refer to the 
LASSO problem as 'Plasso(A) and to the general class of ^-regularized problems as V(\). In this 
section, we outline the steps necessary to derive a SAFE method for the general problem "P(A). 
We show some preliminary results for deriving SAFE methods when /(£) is the hing loss function, 
/hi(C) = (1 ~ + 7 an d the logistic loss function /i og (£) = log(l + 

4.1 Dual Problem 

The first step is to devise the dual of problem © , which is 

V(X) : (j>(\) = max G{9) : 6 T b = 0, |6> T a; fe | < A, k=l,...,n, (10) 

6 

where 

m 

G{6):=c T 6~Y.f*^) (H) 

with /*($) = max^ — f(£) the conjugate of the loss function /(C), and Xk the fc-th column or 
feature of the feature matrix X = (oi, . . . , a m 

y e K mxn^ G (Qj ig the dual f unctioilj which iSj by 
construction, concave. We assume that strong duality holds and primal and dual optimal p o ints a re 



attained. Due to the optimality conditions for the problem (see iBovd and Vandenberghei ( 200 
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constraints for which |# T Xfc| < A at optimum correspond to a zero element in the primal variable: 
(w*) k = 0, i.e. 

\9* T x k \ < A => (w*) k = 0. (12) 

4.2 Optimality set 6 

For simplicity, we consider only the set := {9 | G(9) > 7} which contains 9* the dual optimal 
point of Z?(A). One way to get a lower bound 7 is to find a dual point 9 S that is feasible for the dual 
problem P(A), and then set 7 = G(9 S ). 

To obtain a dual feasible point, we can solve the problem for a higher value Ao > A of the 
penalty parameter. (In the specific case examined below, we will see how to set Ao so that the 
vector Wq = at optimum.) This provides a dual point 9q that is feasible for X>(Ao), which satisfies 
Ao = 11^6*0 1| 00 ■ In turn, 9q can be scaled so as to become feasible for T>(\). Precisely, we set 9 S = s9q, 
with ll-X^slloo < A equivalent to |s| < X/Xq. In order to find the best possible scaling factor s, we 
solve the one-dimensional, convex problem 

7(A) := max G(s9 ) : |s| < A. (13) 
s Ao 

Under mild conditions on the loss function /, the above problem can be solved by bisection in O(m) 
time. By construction, 7(A) is a lower bound on 0(A). We can generate an initial point 9q by solving 
P(A ) with w = 0. We get 

m m 

min y f(hvo + a) = min max 6T{bvo + c) — y /* ((0o),) = max G(9q). 

vo * — • v do t — ' On : b T dn=0 

Solving the one-dimensional problem above can be often done in closed-form, or by bisection, in 
O(m). Choosing 9q to be any optimal for the corresponding dual problem (the one on the right- 
hand side) generates a point that is dual feasible for it, that is, G(6q) is finite, and b T 9o = 0. 

The point 9q satisfies all the constraints of problem T>(X), except perhaps for the constraint 
||-X"#||oo < A, i.e. ||X#ol|oo > A. Hence, if A > A := ||X#o||oo, then 9q is dual optimal for X>(A) and 
by the optimality condition (fT2")) we have w* = . Note that, since 8q may not be uniquely defined, 
Aq may not necessarily be the smallest value for which w* — is optimal for the primal problem. 



4.3 SAFE method 

Assume that a lower bound 7 on the optimal value of the learning problem </>(A) is known: 7 < 0(A). 
(Without loss of generality, we can assume that < 7 < YZLi f(ci))- The test 

A > max(P(7, x k ), P(j, -x k )), 

allows to eliminate the fc-th feature from the feature matrix X, where P(7, x^) is the optimal value 
of a convex optimization problem with two constraints: 

P(7,x fe ) := max 9 T x k : G(8) > 7, 9 T b = 0. (14) 
9 

Since P(7, x k ) decreases when 7 increases, the closer 4>(X) is to its lower bound 7, the more aggressive 
(accurate) our test is. 

By construction, the dual function G is decomposable as a sum of functions of one variable only. 
This particular structure allows to solve problem (|14p very efficiently, using for example interior- 
point methods, for a large class of loss functions /. Alternatively, we can express the problem in 
dual form as a convex optimization problem with two scalar variables: 



\ - , / {xk)i + MCi + vb % . 
P{l,Xk)= mm -7 / u + M> / 1 ■ (15) 
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Note that the expression abov e involves the perspective of the function /, which is convex (see 



Bovd and Vandenberghd (|2004[) ). For many loss functions /, the above problem can be efficiently 
solved using a variety of methods for convex optimization, in (close to) 0(m) time. We can also 
set the variable v = 0, leading to a simple bisection problem over fi. This amounts to ignore the 
constraint 9 T b = in the definition of P(7, x), resulting in a more conservative test. More generally, 
any pair (/x, v) with fi > generates an upper bound on P(j,x), which in turn corresponds to a 
valid, perhaps conservative, test. 

4.4 SAFE for Sparse Support Vector Machine 

We turn to the sparse support vector machine classification problem: 

m 

V hi (X) ■ 0(A) :=min VCl-i/i^w + «))+ + A|H|i, (16) 

W,V L ' 

i=l 

where z\ £ M™, i = 1, . . . , m are the data points, and y E {—1,1}™ is the label vector. The above is 
a special case of the generic problem ©, where /(C) := (1 — £)+ i s ^he hinge loss, b = y, c = 0, and 
the feature matrix X is given by X = \y\Zi, . . . , y m z m ] T , so that Xk = [y\Zi(k), . . . , y m z m (k)] T ■ 

We denote by Z+,2L the set of indicies corresponding to the positive and negative classes, 
respectively, and denote by m± = \T±\ the associated cardinalities. We define m := min(m + ,m_). 
Finally, for a generic data vector x, we set x^ = (xi)igx ± <G R" 1 *, k = 1, ...,n, the vectors 
corresponding to each one of the classes. 

The dual problem takes the form 

V hl {\) : <t>{\) := max G M (0) : -1 < 9 < 0, 6 T y = 0, \B T x k \ < A, fe = l,...,n. (17) 
with G hi ((9) = 1 T 0. 
AAA Test, 7 given 

Let 7 be a lower bound on 4>{X). The optimal value obtained upon setting w = in (fl6| is given by 



mm 

V 

i=l 



y^(l - yjv)+ = 2 min(m + , m_) := 7 max . (18) 



Hence, without loss of generality, we may assume < 7 < 7,, 
The feature elimination test hinges on the quantity 



Phi(l,x) = max 9 T x : 1 T 6 > 7, 9 T y^0 1 1 < 9 < 

= mm -7/z + /1 > ^ 

*.><),«/ \ M / (19) 

= min -7/x + V](/Li + -a:,) + . 

i=l 

In appendix lC.il we show that for any a;, the quantity P(7, a;) is finite if and only if < 7 < 7 ma x> 
and can be computed in 0(m\ogm), or less with sparse data, via a closed-form expression. That 
expression is simpler to state for Phi (7, —x): 

L7/2J m 

Pu(7,~x) = ^ x ] -{-~l-\){x Vl/2l+l )++ ^2 0<7<7max = 2m, 

J'=l J=L7/2J+1 

a?j := Xy, + X- • , J = 1, . . . , to, 
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with x\j] the j-th largest element in a vector x, and with the convention that a sum over an empty 
index set is zero. Note that in particular, since 7 max = 2m: 

m 

fhi(7max, ~X) = ]T)0^ + X^). 
i=l 

4.4.2 SAFE-SVM theorem 

Following the construction proposed in section I4~2l for the generic case, we select 7 = G\u(9), where 
the point 9 is feasible for (fl7|) . and can found by the scaling method outlined in section 1431 as 
follows. The method starts with the assumption that there is a value Ao > A for which we know the 
optimal value 70 of "Phi(Ao). 

Specific choices for Ao,7o- Let us first detail how we can find such values Ao, 70- 

We can set a value Ao such that A > Ao ensures that w = is optimal for the primal problem (|16p . 
The value that results in the least conservative test is Ao = A max , where A max is the smallest value 
of A above which w = is optimal: 

A max :=min \\X0W*, : -9 T 1 > 7max , 9 T y = 0, -1 < < 0. (20) 
9 

Since A max may be relatively expensive to compute, we can settle for an upper bound A max on A max . 
One choice for A max is based on the test derived in the previous section: we ask that it passes for 
all the features when A = A max and 7 = 7 max . That is, we set 



(21) 



Amax = max max(P hi (7 max ,Xfc),Phi(7max,-^/e)) 
1 < h < n 

(mm \ 
i=l i=l / 

By construction, we have A max > A max , in fact: 

A max = max max \xj6\ : ~9 T 1 > 7, nax , 9 T y = 0, — 1 < 9 < 

l<fc<n 9 

= max \\X6Woo : -9 T l> lmax , 9 T y = 0, -1 < 6 < 0, 

The two values A max , A max coincide if the feasible set is a singleton, that is, when m+ = m_. On 
the whole interval Ao € [A max , A max ], the optimal value of problem 'Phi(Ao) is 7 max . 

Dual scaling. The remainder of our analysis applies to any value A for which we know the optimal 
value 7 G [0,7 max ] of the problem 7\i(A )- 

Let 9 be a corresponding optimal dual point (as seen shortly, the value of 9 is irrelevant, as we 
will only need to know 70 = l T 9o). We now scale the point 9q to make it feasible for "Phi(A), where 
A (0 < A < Ao) is given. The scaled dual point is obtained as 9 = s9o, with s solution to (fT5)) . We 
obtain the optimal scaling s = A/Ao, and since 70 = —1 t 9q, the corresponding bound is 

7(A) = l T (s6» ) = s 7o = 7ov"- 

^0 

Our test takes the form 

A > max (P hi (7(A),, x), P hi (7(A), -x)) . 
Let us look at the condition A > Phi(7(A), —x): 

nt 

3n>Q,v : A > - 7 (A)/i + + vyi + Xj) + , 

i=i 



12 



Safe Feature Elimination 



which is equivalent to: 

m 



\ _ • »=i 
A > mm 



m>o,k 1 + (7o/Ao)jU 

The problem of minimizing the above objective function over variable v has a closed-form solution. 
In appendix IC.21 we show that for any vectors x^ € R m± , we have 

$(x+,aT) := min ^T(xf + v) + + ^{xT - v ) + = J2( x [i] 

i— 1 i= 1 2 — 1 

with x\j-\ the j-th largest element in a vector x. Thus, the test becomes 

in 

A > mm 



(22) 



m>o 1 + (7o/Ao)A* 
Setting k = Ao/(Ao + 7om); we obtain the following formulation for our test: 

where := xjji + fe, i = 1, . . . ,m, and for z G R m , we define 

m 

G(z) := min > (1 — K + KZj)+. 

0<K<1 ^— ' 
~ ~~ i=l 

We show in appendix IC.3I that G(z) admits a closed-form expression, which can be computed in 
O(dlogii), where d is the number of non-zero elements in vector z. By construction, the test removes 
all the features if we set Ao = A max , 70 = 7max, and when A > A max . 

Theorem (SAFE-SVM) Consider the SVM problem 7> hi (A) in {H|). Denote by x k the k-th 
row of the matrix [yiZi, ■ ■ ■ , y m Z m ], and let X± := {i : yi = ±1}, m± := \T±\, m := min(?7i + , m_), 
and 7 m ax ■= 2m. Let Xq > A be a value for which the optimal value 70 G [0,7 max ] of V sq (Xo) is 
known. The following condition allows to remove the k-th feature vector x k : 

A > ^£ max ( G{^-x k ), G(^-x k )) , (23) 
7o V 2A 2A / 

where (x k )i := (x k )~^ + (x k )^, {x k ) l := (-x k )^ + (-x k )^, i = l,...,m, and for z <E M m : 

1 P 

G(z) = mm — - ^(Zi - z)+ : z € {-oo, 0, (zj)j : Zj<0 } 

i=l 

A specific choice for Ao is A max given by V21}) , with corresponding optimal value 70 — 7max- B 

4.5 SAFE for Sparse Logistic Regression 

We now consider the sparse logistic regression problem: 

in 

V lo (X) : 0(A) := min V log (l + cxp(- yi (zf w + v))) + A|M|i« (24) 

W.V » 

i=l 

with the same notation as in section 14.41 The dual problem takes the form 

m 

Ao(A) : 0(A) := max £ (0, log(-0«) - (1 + O.f log(l + 0,)) : -1 < 9 < 0, 6 T y = 0, (25) 
i=i \9 T x k \ < A, k = l,...,n. 
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4.5.1 Test, 7 given 

Assume that we know a lower bound on the problem, 7 < 4>(X). Since < <fi(\) < mlog2, we may 
assume that 7 <E [0, mlog2] without loss of generality. We proceed to formulate problem (jT5|l . For 
given x <G R m , and 7 £ R, we have 

fl og (7, x) = min -7/z + /U V /log ( ^ + — ) , (26) 

which can be computed in O(m) by two-dimensional search, or by the dual interior-point method 
described in appendix. (As mentioned before, an alternative, resulting in a more conservative test, 
is to fix i/, for example v — 0.) Our test to eliminate the fc-th feature takes the form 

A > Ti og (7,x fc ) := max(P log (7,Xfc),Pi og (7, -x*)). 

If 7 is known, the complexity of running this test through all the features is 0{nm). (In fact, the 
terms in the objective function that correspond to zero elements of x are of two types, involving 
/log (if///). This means that the effective dimension of problem (|26[) is the cardinality d of vector 
x, which in many applications is much smaller than m.) 

4.5.2 Obtaining a dual feasible point 

We can construct dual feasible points based on scaling one obtained by choice of a primal point 
(classifier weight) Wq- This in turn leads to other possible choices for the bound 7. 
For wq £ R n given, we solve the one-dimensional, convex problem 

m 

v Q := argmin f\ og (yixjw + ytb). 

i=l 



This problem can be solved by bisection in 0{m) time lKim et al.l (|2007l ). At optimum, the derivative 
of the objective is zero, hence y T 9o = 0, where 



1 

1 + cxp(yixfw + y l v ) 
Now apply the scaling method seen before, and set 7 by solving problem (TT 

4.5.3 A SPECIFIC EXAMPLE OF A DUAL POINT 

A convenient, specific choice in the above construction is to set wq = 0. Then, the intercept vq can 
be explicitly computed, as vq = log(m_|_/m_), where m± = |{i : yi = ±1}| are the class cardinalities. 
The corresponding dual point #0 is 

m_ 

(»i = +l) 

™ + i = l,...,m. (27) 
[Vi = -1), 

m 



The corresponding value of Aq is (see Kim et ail ( 20071 )): 



A := ||X T o ||oc = max \9%x k \. 

l<k<n 

We now compute 7(A) by solving problem p3j) . which expresses as 

7(A) = max Gi og (s6 ) = max -m+f*(-s-^-)-m-f*(-s ). (28) 

|s|<A/A 1*1 <A/Ao m s m 

The above can be solved analytically: it can be shown that s = A/Aq is optimal. 
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4.5.4 Solving the bisection problem 

In this section, we are given c <E R™, 7 € (0,mlog2), and we consider the problem 

m 

F* := mm F(fi) := - 7M + M / log ( c (i)/ M ). (29) 

Problem (|2"9"|) corresponds to the problem (j2"o]) . with 1/ set to a fixed value, and c(i) = y^Xi, i = 
l,...,m. We assume that c(i) 7^ for every i, and that n := mlog2 — 7 > 0. Observe that 
F* < Fq := Um„_ ) .o+ ^X/- 4 ) = l T c+; where c + is the positive part of vector c. 

To solve this problem via bisection, we initialize the interval of confidence to be [0, fj, u ], with [i u 
set as follows. Using the inequality log(l + e~ x ) > log2 — (l/2)x + , which is valid for every x, we 
obtain that for every fi > 0: 

(<#)M .... 1 



i=l 



We can now identify a value /x u such that for every /i > /i u , we have -F(^) > i*b: it suffices to ensure 
ti^i - (l/2)l T c+ > F , that is, 

(l/2)l r c + +F 3 l T c+ 
[i > fi u := 



n 2 m log 2 — 7 

4.5.5 Algorithm summary 

An algorithm to check if a given feature can be removed from a sparse logistic regression problem 
works as follows. 

Given: A, k (1 < k < n), f log (x) = log(l + e~% /,* og (i?) = (-1?) log(-tf) + (0 + 1) log(0 + 1). 

1. Set Ao = max |0 o a;fc|, where 6 0(1) = —ni-/m (j/j = +1), 6q(i) = —m+/m [yi = — 1), 

l<k<n 

i = 1, . . . , m. 

2. Set 

, U ,* / A TO A TO+ 

7(A) := - m+ / Iog (--_) - m _ /log (___). 

3. Solve via bisection a pair of one-dimensional convex optimization problems 

in 

P e = mm -j(X)fj, + fj,S2 fi og (eyi(x k )i/ fJ.) (e = ±1), 

i=l 

each with initial interval [0,/z u ], with 

m 

_ 3 j=i 

M " ~~ 2 to log 2 - 7 ' 

4. If A > max(P_|_ , P_ ) , the fc-th feature can be safely removed. 
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5. Numerical results 

In this section we explore the benefits of SAFE by running numerical experiment^] with different 
LASSO solvers. We present two kinds of experiments to highlight the two main benefits of SAFE. 
One kind, in our opinion the most important, shows how memory limitations can be reduced, by 
allowing to treat larger data sets. The other focuses on measuring computational time reduction 
when using SAFE a priori to the LASSO solver. 

We have used a variety of available algorithms for solving the LASSO problem. We use acronyms 
to refer to the follo wing methods: IPM stands for the Interior-Point Method for LASSO described 
in Kim et al.1 (2007); G LMNET corresponds to the Generalized Linear Model algorithm described in 



Friedman et al 
in iBecker et al 



1 201C ): TFOCS corresponds to Templates for First-Order Conic Solvers described 



2010|); FISTA and Homotopy stand for the Fast I terative Shr i nkage -Thresholding 



Algorithm and homotopy algorithm, described and implemented in Yang et al. ( 2010l ). respectively. 



Some methods (like IPM, TFOCS) do not return exact zeros in the final solution of the LASSO 
problem and the issue arises in evaluating the its cardinality. In appendix [El we discuss some issue 
related to the thresholding of the LASSO solution. 



In our experiments, we use data sets derived from text classification sources in lFrank and Asuncion 



(2010). We use medical journal abstracts from PubMed represented in a bag-of-words format, where 



stop words have been eliminated and capitalization removed. The dimensions of the feature matrix 
X we use from PubMed is in = 1,000,000 abstracts and n = 127,025 features (words). There is 
a total of 82, 209, 586 non-zeros in the feature matrix, with an average of about 645 non-zeros per 
feature (word). We also use data-sets derived from the headlines of The New York Times, (NYT) 
spanning a period of about 20 years (from 1985 to 2007). The number of headlines in the entire 
NYT data-set is m = 3, 241, 260 and the number of features (words) is n = 159, 943. There is a total 
of 14, 083, 676 non-zeros in the feature matrix, with an average of about 90 non-zeros per feature. 

In some applications such as iGawalt et all (|2010l ). the goal is to learn a short list of words that 



are predictive of the appearance of a given query term (say, "lung" or "china") in the abstracts of 
medical journals or NYT news. The LASSO problem can be used to produce a summarization of 
the query term across the many abstracts or headlines considered. To be manageable by a human 
reader, the list of predictive terms should be very short (say at most 100 terms) with respect to the 
size of the dictionary n. To produce such a short list, we solve the LASSO problem (fT]) with different 
penalty parameters A, and choose the appropriate penalty A that would generate enough non-zeros 
in the LASSO solution (around 100 non-zeros in our case). 



5.1 SAFE for reducing memory limit problems 

We experiment with PubMed data-set which is too large to be loaded into memory, and thus not 
amenable to current LASSO solvers. As described before, we are interested in solving the LASSO 
problem for a regularization parameter that would result in about 100 non-zeros in the solution. 
We implement algorithm [T] with a memory limit M = 1,000 features, where we have observed that 
for the PubMed data loading more than 1,000 features causes memory problems in the machine 
and platform we are using. The memory limit is approximately two orders of magnitudes less than 
the original number of features n, i.e. M w O.Oln. Using algorithm [TJ we were able to solved the 
LASSO problem for A = 0.04A maa; using a sequence of 25 LASSO problem with each problem having 
a number of features less than M — 1,000. Figure [3] shows the simulation result for the PubMed 
data-set. 



1. In our experiments, we have used an Apple Mac Pro 64-bit workstation, with two 2.26 GHz Quad-Core Intel Xeon 
processors, 8 MB on-chip shared L3 cache per processor, with 6 GB SDRAM, operating at 1066 MHz. 
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0.6 0.8 1 



Figure 3: A LASSO problem solved for the PubMed data-set and A = 0.04A rnaa; using a sequence of 
25 smaller size problems. Each LASSO problem in the sequence has a number of features 
Lp that satisfies the memory limit M = 1, 000, i.e Lp < 1, 000. 
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Figure 4: (a) Computational time savings, (b) Lasso solution for the sequence of problem between 
0.03\ ma x and \ ma x- The green line shows the number of features we used to solve the 
LASSO problem after using algoirthm[3] 



5.2 SAFE for LASSO run-time reduction 

We have used a portion of the NYT data-set corresponding to all headlines in year 1985, the cor- 
responding feature matrix has dimensions n = 38, 377 features and m = 192, 182 headlines, with 
an average of 21 non-zero per feature. We solved the plain LASSO problem and the LASSO prob- 
lem with SAFE as outlined in algoirthm [3] for a sequence of A logarithmically distributed between 
. We have used four LASSO solvers, IPM, TFOCS, FISTA and Homotopy to 



solve the LASSO problem. Figure 4(a) shows the computational time saving when using SAFE. Fig- 
ure [4(b)] shows the number of features we used to solve the LASSO problem when using SAFE, and 
the number of non-zeros in the solution. We realize that when using algorithm [3] we solve problems 
with a number of features at most 10, 000 instead of n = 38, 377 features, this reduction has a direct 



impact on the solving time of the LASSO problem as demonstrated in figure 4(a) 



5.3 SAFE for LASSO with intercept problem 

We return to the LASSO with intercept problem discussed in section 12.51 We generate a feature 
matrix X £ ]R mx ™ w ith m = 500, n = 10 6 . The entries of X has a Af(0, 1) normal distributed and 
sparsity density d = 0.1. We also generate a vector of coefficients u> £ M. n with 50 non-zero entries. 
The response y is generated by setting y — Xus + O.Olry, where 77 is a vector in M™ 1 with 7V(0, 1) 
distribution. We use GLMNET implemented in R to solve the LASSO problem with intercept. The 
generated data, X and y can be loaded into R , yet memory problems occur when we try to solve the 
LASSO problem. We use algorithm [T] with memory limit M = 10,000 features and A = 0.33A max - 
Figure [5] shows the number of non-zeros in the solution of the 352 sequence of problems used to 
obtain the solution at A = 0.33A max . 
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Figure 5: A LASSO problem with intercept solved for randomly generated data-set and A = 
0.33A max using a sequence of 352 smaller size problems. Each LASSO problem in the 
sequence has a number of features Lp that satisfies the memory limit M = 10,000, i.e 
Lp < 1000. 
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Appendix A. Expression of P("f, £&) (LASSO) 

We can express problem ([6]) in dual form as a convex optimization problem with two scalar variables. 
Hi and /z 2 : 



P(7,x k ) = min max x%6 + m (G(9) - 7) + /j, 2 g (6 - 6ft) 

= min — /ii7 — /i2.9 T ^o + max x k & + ViG(Q) + fJ>2g T 



mm -/ii7 - yu 2 .g f + max 6/ - - 



We obtain: 



P(7,Xfc) = min L(ni,fx 2 ) (30) 



with 



M2 ) = -sfo, + ^D 2 + J- ||x fc ||^ + \\g\\l + ^.9 - H2 \\g\\l , (31) 



and £> := - 2 7 J . 

To solve (f30|) . we take the derivative of (|3Tj) w.r.t /x 2 and set it to zero: 

H.9II2 +xlg- Mi II.9II2 =°- 

This implies that ^ 2 = max(0,/ii — jj^p)- When ^ < j^z, we have /i 2 = 0, fi\ = ^ x ^ 2 and 
P(l,Xk) takes the value: 

P( 7 ,x fe ) = -y T T fc + ||a- fc || 2 I?. 



On the other hand, when fix > jj^jp, we take the derivative of (|31| w.r.t /ii and set it to zero: 



D 2 nl = H, 



/ / t \2 \ 1/2 !/ 2 

with * fe = I ||a:fe||2 - j and l) = ^Z? 2 - \\gf 2 j . Substituting m and fi 2 in O, ^(7,2*) 



takes the value: 



p{ 1 ,x k ) = ei T Xk + ^ k b. 
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Appendix B. Expression of P(j, x), general case 

We show that the quantity P(j,x) defined in (fl"4"|) can be expressed in dual form (fl"5l) . This is a 
simple consequence of duality: 

P(7, x) = max 9 T x : G{6) > 7, 6 T b = 



= max min 9 T x + fi(G(9) - 7) - vO T b 

6 ^>0, v 



min max T x + ii(-y T 9 - ^ f*(0(i)) - 7) - v9 T b 

m 

min n + max 9 T (x — fiy — vz) — fi) f*(9(i)) 

i>0. v 6 — ' 

1=1 

/ l m \ 

min —7^ + yti max —9 T (x — \iy — vz) — 7 f*(9(i)) 

. \^ f ( Xi - vy{i) - vh\ 

mm -7/x + /i > / . 

t >0, * V A* / 



u>0, 1/ 

i=i 

Appendix C. SAFE test for SVM 

In this section, we examine various optimization problems involving polyhedral functions in one 
or two variables, which arise in section [4.4.11 for the computation of Phi (7, x) as well as in the 
SAFE-SVM theorem of section |41~2"1 

C.l Computing Phi (7, x) 

We first focus on the specific problem of computing the quantity defined in (|19[) . To simplify notation, 
we will consider the problem of computing Phi (7, —x), that is: 

m 

Phi(l, -x) = min -'Y/j,+ )(fi + vy i + x i ) + , (32) 

i=l 

where y € {— 1, 1}™, x £ M. m and 7 are given, with < 7 < 70 := 2 min(m + , m_). Here, X± := {i : 
yi = ±1}, and x + = (xi)i^x + , x~ = (xi)i<=i_, m± = |X±|, and m = min(m+,m_). Without loss of 
generality, we assume that both x + , x~ are both sorted in descending order: xf > . . . > x^j . 
Using a = n + v, /3 = /i — v, we have 



Phi (7, -x) = mm -l( a + f3)+ + q-)+ + ^(x, + /3) H 



i=l i=l 

7, 



mm max -±{a + 0) + ^(xf + a)+ + ^(^- + 0)+ - t(a + /?) 



(33) 

m_ 



= max mm -(J + t)(a + /?) + V(4 + a)+ + + /3)+ 

_ i=l i=l 

= max F(- +t,x + ) + F(- +t,x~), 
t>o 2 2 

where, for /igM and a; € R p , x\ > . . . > x p , we set 

p 

F(h,x) := min -/iz + y^(z + x l ) + , (34) 
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Expression of the function F. If ft > p, then with z — > +00 we obtain F(h, x) = —00. Similarly, 
if ft < 0, then z — > —00 yields F(h, x) = —00. When < h < p, we proceed by expressing F in dual 
form: 

F(h,x) = max u T x : < u < 1, u T l = ft. 

If /7 = p. then the only feasible point is u = 1, so that F(p,x) = l T x. If < ft < 1, choosing 
Ui = ft, U2 = • • • = v-p = 0, we obtain the lower bound F(h,x) > ftxi, which is attained with 
2 = — xi. 

Assume now that 1 < ft < p. Let h = q + r, with <7 = [ftj the integer part of ft, and < r < 1. 
Choosing u\ = . . . = Uq = 1, 7i 9 +i = r, we obtain the lower bound 

1 

F(h, x) > ^2 Xj + rx q+ i, 
3=1 



which is attained by choosing z = — x q +i in the expression (|34[) . 
To summarize: 

( hxi if < ft < 1, 

Y^ x j + ( h ~ lAD^lAI+i ifl<ft<P, 
F(h,x) = I 3=1 



(35) 



^2 X J if ft = p, 

j'=i 

, —00 otherwise. 

A more compact expression, valid for < ft < p if we set x p+ i = a; p and assume that a sum over an 
empty index sets is zero, is 

Lfej 

F{h,x) = Y^Xj + {h- [h\)x w+u 0<h<p. 
3=1 

Note that F(-,x) is the piece- wise linear function that interpolates the sum of the ft largest elements 
of x at the integer break points ft = 0, . . .,p. 

Expression of Phi (7, —a;). We start with the expression found in (|33l) : 

Phi(7, -x) = max F(| +t,x+) + +t,x~). 

Since the domain of F(-,x + ) + F(-,x~) is [0,m], and with < 7/2 < 70/2 = m, we get 
Phi(7,— x) = max G(ft, x + , x _ ) := F(h, x + ) + F(h, x~). 

j/2<h<m 

Since F(-,x) with x € R p is a piece- wise linear function with break points at 0, . . . ,p, a maximizer 
of G(-,x + ,x~) over [7/2,771] lies in {7/2, L7/2J + 1, . . . ,m}. Thus, 

Phi(l,-x) = max ( G(^,x + ,x~), max G(ft, .t+, a;") ) . 

Let us examine the second term, and introduce the notation Xj := x^ + xj , j = 1, . . . , m: 

h 

max G(h,x + ,x~) = max > (xf + xj) 

he{ll/2\+l,..., m } he{[l/2i+l,...,rn} ^ 3 3 

L7/2J + 1 m 

H H (%)+' 

i=l J=L7/2J+2 
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with the convention that sums over empty index sets are zero. Since 

L7/2J 

G(-^,x + ,x~) = ^2 Xj + (-- L2JK7/2J+1; 
3=1 



we obtain 



L7/2J 



Phih,-x) = Y Xj+max I (- - L^J) a; L7/2j+i^L7/2J+i + Y i x j)+ 

J' =1 \ j=L7/2J+2 

An equivalent expression is: 

L7/2J m 

E7 1 x - 

x j - (2 ~ L2 J )( — a; L-T/2J-!-i)+ + 2^ °^7<2m, 

J =1 j=[ 7 /2j+l 

% := x~j~ +xj , j = 1, . . . , to 

The function Phi(-,— x) linearly interpolates the values obtained for 7 = 2q with <j integer in 
{0,...,to}: 

q m 

p hi (2 g ,-x) = y^j + Y 

3=1 .3=9+1 

C.2 Computing x ) 

Let us consider the problem of computing 

m_i. m_ 

$(a; + , aT) := min + + ^(x" - 

?:=i ?:=i 

with £ R m± , xf>...> x^ , given. We can express $(.t + , .t~) in terms of the function F 
defined in (I531) : 

,+ 



$(x + ,a; ) = min (a; + + z^ + ) + + (a 

-hv + + Y (4 + + hv ~ + S K r - 0+ 







iei+ 


= max 


min 




1/+ 


= max 


min 


h 




= max 


min 







max min — hv + + z/)_|_ I + I r 



-/iiy + Y ( x i + v )+ ( v + 



max F(h, x + ) + F(h, x 

h 

max F(h,x + ) + F(h,x~) 

0<h<m 



= max(A, B, C), 
where F is defined in (l34l). and 



A= max F(h,x+) +F(h,x~), B := max F(/i, .t+) + a; - )), C = F(m, x + ) + F(m, aT). 

0</*<l l<h<m 



We have 



A := max F(h, x + ) + F(h, x )~ max + x-, ) = (xf + x-, )+. 

oa<i oa<i 
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Next: 



B = max F(h,x + ) + F{h,x~) 

l<h<m 



Observe that 



max y^( x t + x i ) + r { x t+i + 

ge{l,...,m-l},re[0,l[ ^ 1 1 q+L 

q 

fn maX n , + X i ) + + - T <Z+l) + 

in 

(x+ + x-) + Y,(4 + x 7)+- 

m 

B>C = 52(xt+xT). 



Moreover, if (x^ + x x ) > 0, then B = Y^iLi( x t + )+ — ^- On the other hand, if xf + x 1 < 0, 
then xf + x~l < for 2 < j < m, and A = JV^i ( x t + x 7)+ > x t + x i = B. In all cases, 

in 

<P(x+,x-) =max(A,B,C) =J2( x t + x i)+- 

i=l 

C.3 SAFE-SVM test 

Now we consider the problem that arises in the SAFE-SVM test 

p 



G(z) := min > (1 — n + KZi)+, 

0<K<1 ^ 



where z e f is given. (The SAFE-SVM condition $?Z§ involves z H = 7 /(2A )(a; [ l : ] + x^), i = 
1, . . . ,p := m.) We develop an algorithm to compute the quantity G(z), the complexity of which 
grows as O(dlogd), where d is (less than) the number of non-zero elements in z. 
Define Z± = {i : ± Zl > 0}, k := h := l=lo,l := \1 \. 

If fc = 0, Z+ is empty, and k = 1 achieves the lower bound of for G (z). If k > and /). = 0, 
that is, A: + / = p, then Z_ is empty, and an optimal k is attained in {0, 1}. In both cases (I + or X_ 
empty), we can write 



G(z) = min > (1 — k + nzi) + = min (p, S+) , 5+:= > 



l iei+ 

with the convention that a sum over an empty index set is zero. 

Next we proceed with the assumption that k ^ and h ^ 0. Let us re-order the elements of I_ 
in decreasing fashion, so that zi > = z^+i = . . . = Zk+i > Zk+i+i > ■ ■ ■ > z p , for every i G I + . 
(The case when Xq is empty is handled simply by setting / = in our formula.) We have 



G(z) = k + I + min < na + (1 — K + KZi) + 



i=k+l+l 



where, a := S + — k — I. The minimum in the above is attained at k = 0, 1 or one of the break points 
1/(1 — Zj) G (0, 1), where j G {k + I + 1, . . . At k = 0, 1, the objective function of the original 
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problem takes the values S+,p, respectively. The value of the same objective function at the break 
point K = 1/(1 — Zj), j = k + I + 1, . . . ,p 1 is k + I + Gj(z), where 



1 - Zj 



3 i=k+l+l v x J y + 

1 



— — ( a- (j-k-l- l)zj+ z i\ 
Zj \ i=k+i+i ) 

I —{ S+ -(j-l)z j -(k + l)(l-z j )+ Yl 

J V i=k+l- 
-(A; + *) + _-_ • 



This allows us to write 



G(z) = min [ p, z,, min [ Zi — (j — l)zj \ \ . 

{ k - 1 - \k J) 

The expression is valid when k + I = p (h = 0, Z_ is empty), I = (Xq is empty), or k = (1+ is 
empty) with the convention that the sum (resp. minimum) over an empty index set is (resp. +oo). 
We can summarize the result with the compact formula: 

1 P 

G(z)=min- V(z,-z) + : z G {-oo, 0, (zj)j : Zj<0 }. 

% ± — Z 

i=l 

Let us detail an algorithm for computing G(z). Assume h > 0. The quantity 

G(z):= min (G 3 {z)) 
k+l+\<3<p 

can be evaluated in less than 0(h), via the following recursion: 



1 —Gj{z) — J-P^ ; , ; , i lir\ 

l-Zj+i l-^'+i , 3 = k + l + 1, (36) 



= , 3 -Gj{z)-3 

G J+1 (z) = min(G J (z),G J+ i(z)) 

with initial values 



Gfc+i+i (z) = G fc+(+1 (z) = V Zi - (k + l)z k+i+1 

t - Zk+l+l 1 



K i=l 



On exit, G(z) = G p . 

Our algorithm is as follows. 

Algorithm for the evaluation of G(z). 

1. Find the index sets I + , Z_, Iq, and their respective cardinalities k,h,l. 

2. If fc = 0, set G(z) = and exit. 
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3. Set S+ = J2i=i z i- 

4. If h = 0, set G(z) = min(p, S+), and exit. 

5. If h > 0, order the negative elements of z. and evaluate G(z) by the recursion (|3€j[) . Set 
G(z) = min(p, <S+, G(z)) and exit. 

The complexity of evaluating G(z) thus grows in 0(k + hlogh), which is less than O(dlogd), where 
d = k + h is the number of non-zero elements in z. 



Appendix D. Computing Pi og (7, %) via an interior-point method 

We consider t he problem (|26p which arises w ith the logistic loss. We can use a generic interior 



point method Boyd and Vandenberghel ( 2004 ). and exploit the decomposable structure of the dual 



function G\ og . The algorithm is based on solving, via a variant of Newton's method, a sequence of 
linearly constrained problems of the form 



min tx t 8 + \og(Gi og (8) -7) +^log(-0- 6> 2 ) : z T 9 = 



where r > is a parameter that is increased as the algorithm progresses, and the last terms 
correspond to domain constraints 8 £ [— 1, 0] m . As an initial point, we can take the point 8 generated 
by scaling, as explained in section l4~2l Each iteration of the algorithm involves solving a linear system 
in variable S, of the form H5 = h, with if is a rank-two modification to the Hessian of the objective 
function in the problem above. It is easily verified that the matrix H has a "diagonal plus rank-two" 
structure, that is, it can be written as H = D — gg T — vv T , where the m x m matrix D is diagonal 
and <?, v € K m are computed in 0(m). The matrix H can be formed, as the associated linear system 
solved, in O(m) time. Since the number of iterations for this problem with two constraints grows as 
log(l/e)0(l), the total complexity of the algorithm is log(l/e)0(m) (e is the absolute accuracy at 
which the interior-point method computes the objective). We note that memory requirements for 
this method also grow as 0{m). 



Appendix E. On thresholding methods for LASSO 

Sparse classification algorithms may return a classifier vector w with many small, but not exactly 
zero, elements. This implies that we need to choose a thresholding rule to decide which elements to 
set to zero. In this section, we discuss an issue rela ted to the thresho lding rule originally proposed for 



the interior point method for Logistic algorithm in lKoh et al] (|2007l ), and propose a new thresholding 
rule. 

The KKT thresholding rule. Recall that the primal problem for LASSO is 

0(A) = min \\\X T w -y\\l + A|M|i- (37) 

Observing that the KKT conditions imply that, at opt imum, (X(X T w — y))k = Asign(wfc), with the 
convention sign(0) £ [—1,1], and following the ideas of Koh et al. ( 20071 ). the following thresholding 



rule can be proposed: at optimum, set component Wk to whenever 

\{X{X T w - y)) k \ < 0.9999A. (38) 

We refer to this rule as the "KKT" rule. 

The IPM-LASSO algorithm takes as input a "duality gap" parameter e, which controls the relative 
accuracy on the objective. When comparing the IPM code results with other algorithms such 
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as GLMNET, we observed chaotic behaviors when applying the KKT rule, especially when the 
duality gap parameter e was not small enough. More surprisingly, when this parameter is not 
small enough, some components Wk with absolute values not close to can be thresholded. This 
suggests that the KKT rule should only be used for problems solved with a small enough duality 
gap e. However, setting the duality gap to a small value can dramatically slow down computations. 
In our experiments, changing the duality gap from e = 10~ 4 to 10 -6 (resp. 10 -8 ) increased the 
computational time by 30% to 40% (resp. 50 to 100%). 

An alternative method. We propose an alternative thresholding rule, which is based on con- 
trolling the perturbation of the objective function that is induced by thresholding. 

Assume that we have solved the LASSO problem above, with a given duality gap parameter e. 
If we denote by w* the classifier vector delivered by the IPM algorithm, w* is e-sub-optimal, that 
is, achieves a value 

r = l\\Xw*-y\\ 2 2 + X\\w*\\ 1 , 

with < 0* - 0(A) < 60(A). 

For a given threshold r > 0, consider the thresholded vector w(r) defined as 



o ifKI<r, 

wl otherwise, 



We have w(t) = w* + <5(r) where the vector of perturbation <5(r) is such that 

6k[T) ~ { otherwise, 

Note that, by construction, we have ||w*||i = \\w* + S\\i + \\S\\i. Also note that if w* is sparse, so is 
S. 

Let us now denote by </> r the LASSO objective that we obtain upon replacing the optimum 
classifier w* with its thresholded version w(t) = w* + S(t): 

0r := ^\\X(w*+6(r))-yg + X\\w*+6(T)\\ 1 , 
Since w(t) is (trivially) feasible for the primal problem, we have <fi T > 0(A). On the other hand, 
0r = \\\Xw*-y\\l + X\\w*+5(r)\\ 1 + ^\\X6(r)\\l + S(T) T X T (Xw*-y) 
< \\\Xw* -y\\l + \\\w*\U + \\\X5{t)\\1 + S(r) T X T (Xw* - y). 
For a given a > 1, the condition 

C(r) := \\\X6(t)\\ 2 + S(r) T X T (Xw* - y) < ^\ k := - 1 > 0, (39) 

allows to write 

0(A) < T < (l + ae)0(A). 

The condition (|39p then implies that the thresholded classifier is sub-optimal, with relative accuracy 
ae. 

Our proposed thresholding rule is based on the condition (|39|) . Precisely, we choose the parameter 
a > 0, then we set the threshold level r by solving, via line search, the largest threshold r allowed 
by condition (f39|) : 



r a = argmax r : ||X5(r)|| 2 < [J^^ ~A \\Xw* y\\ 2 . 
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The larger a is, the more elements the rule allows to set to zero; at the same time, the more 
degradation in the objective will be observed: precisely, the new relative accuracy is bounded by ae. 
The rule also depends on the duality gap parameter e. We refer to the thresholding rule as TR(a) 
in the sequel. In practice, we observe that the value a = 2 works well, in a sense made more precise 
below. 

The complexity of the rule is O(mn). More precisely, the optimal dual variable 8* = Xw* — y 
is returned by IPM-LASSO. The matrix X8* = X(X T w* — y) is computed once for all in 0(mn). 
We then sort the optimal vector w* so that \w*^\ < ... < and set r = tq = |to? n J, so that 

<5fc(To) = — and Wk(ro) = for all k — 1, . . . , n. The product X5(to) is computed in 0{mn), while 
the product 5(t ) t (X t 9*) is computed in 0{n). If the quantity C(r ) = ±\\X5(T )\\ 2 + 5(T a ) T {x T 6*) 
is greater than n<f)*, then we set r = t\ = We have Sk(ri) = <5fc(T ) for any k ^ (n) and 

S( n )( T i) = 0. Therefore, C(ti) can be deduced from C(tq) in 0(n). We proceed by successively 
setting Tfc = |u^* n _ fc )| until we reach a threshold T& such that C(i~k) < «</>*. 

Simulation study. We conducted a simple simulation study to evaluate our proposal and compare 
it to the KKT thresholding rule. Both methods were further compared to the results returned by 
the glmnet R package. The latter algorithm returns hard zeros in the classifier coefficients, and we 
have chosen the corresponding sparsity pattern as the "ground truth", which the IPM should recover. 

We first experimented with synthetic data. We generated samples of the pair (X, y) for various 
values of (m,n). We present the results for (m,n) = (5000,2500) and (m,n) = (100,500). The 
number s of relevant features was set to min(?7i, n/2). Features were drawn from independent 
7V(0, 1) distributions and y was computed as y = X T w + £, where £ ~ A/"(0, 0.2) and w is a vector 
of R™ with first s components equal to 0.1 + 1/s and remaining n — s components set to 0. Because 
glmnet includes an unpenalized intercept while IPM method does not, both y and X were centered 
before applying either methods to make their results comparable. 

Results are presented on Figures |51 First, the KKT thresholding rule was observed to be very 
chaotic when the duality gap was set to e = 10~ 4 (we recall here that the default value for the duality 
gap in IPM MATLAB implementation is e = 10~ 3 ), while it was way better when duality gap was set 
to e = 10~ 8 (somehow justifying our choice of considering the sparsity pattern returned by glmnet 
as the ground truth). Therefore, for applications where computational time is not critical, running 
IPM method and applying KKT thresholding rule should yield appropriate results. However, when 
computational time matters, passing the duality gap from, say, 10 -4 to 10 -8 , is not a viable option. 
Next, regarding our proposal, we observed that it was significantly better than KKT thresholding 
rule when the duality gap was set to 10 -4 and equivalent to KKT thresholding rule for a duality gap 
of 10~ 8 . Interestingly, setting a = 1.5 in ((39| generally enabled to achieved very good results for 
low values of A, but lead to irregular results for higher values of A (in the case m = 100, results were 
unstable for the whole range of A values we considered). Overall, the choices a = 2, 3 and 4 lead to 
acceptable results. A little irregularity remained with a = 2 for high values of A, but this choice of 
a performed the best for lower values of A. As for choices a = 3 and a = 4, it is noteworthy that 
the results were all the better as the dimension n was low. 

E.l Real data examples 

We also applied our proposal and compared it to KKT rule (|38|) on real data sets arising in text 
classification. More precisely, we used the New York Times headlines data set presented in the 
Numerical results Section. For illustration, we present here results we obtained for the topic "China" 
and the year 1985. We successively ran IPM-LASSO method with duality gap set to 10 -4 and 10~ 8 
and compare the number of active features returned after applying KKT thresholding rule (|38p and 
TR (1.5), TR (2), TR (3) and TR (4). Results are presented on Figure [3 Because we could not 
applied glmnet on this data set, the ground truth was considered as the result of KKT rule, when 
applied to the model returned by IPM-LASSO ran with duality gap set to 10~ 10 . Applying KKT 
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Figure 6: Comparison of several thresholding rules on synthetic data: the case m = 5000, n = 100 
(top panel) and m = 100, n = 500 (bottom panel) with duality gap in IPM method set 
to (i) 10~ 4 (left panel) and (Hi) 10 -8 (right panel). The curves represent the differences 
between the number of active features returned after each thresholding method and the 
one returned by glmnet (this difference is further divided by the total number of features 
n). The graphs pre sent the results attached to six thresholding rules: the one proposed 
by iKoh et al.l ( 2007 ) and five versions of our proposal, corresponding to setting a in (|39|) 
to 1.5, 2, 3, 4 and 5 respectively. Overall, these results suggest that by setting a <G (2, 5), 
our rule is less sensit ive to the value o f the duality gap parameter in IPM-LASSO than is 
the rule proposed bv lKoh et all (|2007l) . 
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Figure 7: Comparison of several thresholding rules on the NYT headlines data set for the topic 
"China" and year 1985. Duality gap in IPM-LASSO was successively set to 1CP 4 (left 
panel) and 10~ 8 (right panel). The curves represent the differences between the number of 
active features returned after each thresholding method and the one returned by the KKT 
rule when duality gap was set to 10~ 10 . The graphs present the results attached to five 
thresholding rules: the KKT rule and four versions of our rule, corresponding to setting 
a in (f39|) to 1.5, 2, 3 and 4 respectively. Results obtained following our proposal appear 
to be less sensitive to the value of the duality gap used in IPM-LASSO. For instance, for 
the value A = A max /1000, the KKT rule returns 1758 active feature when the duality gap 
is set to 10 -4 while it returns 2357 features for a duality gap of 10 -8 . 



rule on the model built with a duality gap of 10~ 4 lead to very misleading results again, especially for 
low values of A. In this very high-dimensional setting (n = 38377 here), our rule generally resulted 
in a slight "underestimation" of the true number of active features for the lowest values of A when 
the duality gap was set to 10 -4 . This suggests that the "optimal" a for our rule might depend on 
both n and A when the duality gap is not small enough. However, we still observed that our proposal 
significantly improved upon KKT rule when the duality gap was set to 10~ 4 . 



30 



Safe Feature Elimination 



References 

S.R. Becker, E.J. Candes, and M. Grant. Templates for convex cone problems with applications to 
sparse signal recovery. Stanford University Technical Report, 2010. 

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New 
York, NY, USA, 2004. ISBN 0521833787. 

S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM review, 
43:129, 2001. 

David L. Donoho and Yaakov Tsaig. Fast solution of Z_l-norm minimization problems when the 
solution may be sparse. IEEE Trans. Inform. Theory, 54(11):4789-4812, 2008. 

Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression (with 
discussion). Ann. Statist., 32:407-499, 2004. 

Jianqing Fan and Jinchi Lv. Sure independence screening for ultrahigh dimensional feature space. 
J. Roy. Statist. Soc. Ser. B, 70(5):849-911, 2008. 

Jianqing Fan and Jinchi Lv. A selective overview of variable selection in high dimensional feature 
space. Statist. Sinica, 20:101-148, 2010. 

George Forman. An extensive empirical study of feature selection metrics for text classification. J. 
Much. Learn. Res., 3:1289-1305, 2003. 

A. Frank and A. Asuncion. UCI machine learning repository, 2010. URL 

http : / / archive . ics . uci . edu/ml 

Jerome Friedman, Trevor Hastie, Holger Hofling, and Robert Tibshirani. Pathwise coordinate opti- 
mization. Ann. Appl. Statist., l(2):302-332, 2007. 

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized 
linear models via coordinate descent. Journal of Statistical Software, 33(l):l-22, 2010. URL 
http : //www. jstatsof t . org/v33/ iOl/ 

Brian Gawalt, Jinzhu Jia, Luke Miratrix, Laurent El Ghaoui, Bin Yu, and Sophie Clavier. Discov- 
ering word associations in news media via feature selection and sparse classification. In MIR '10: 
Proceedings of the international conference on Multimedia information retrieval, pages 211-220, 
2010. 

Seung-Jean Kim, Kwangmoo Koh, Michael Lustig, Stephen Boyd, and Dimitry Gorinevsky. An 
interior-point method for large-scale ^_l-regularized least squares. IEEE J. Select. Top. Sign. 
Process., 1(4):606-617, 2007. 

Kwangmoo Koh, Seung-Jean Kim, and Stephen Boyd. An interior-point method for large-scale 
^_l-regularized logistic regression. JMLR, 8:1519-1555, 2007. 

Mee Young Park and Trevor Hastie. L_l-regularization path algorithm for generalized linear models. 
J. R. Stat. Soc. Ser. B Stat. MethodoL, 69(4):659-677, 2007. 

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical 
Society. Series B (Methodological), 58(l):267-288, 1996. ISSN 0035-9246. 

A. Yang, A. Ganesh, Z. Zhou, S. Sastry, and Y. Ma. Fast 11-minimization algorithms and an 
application in robust face recognition: a review. University of California at Berkeley Technical 
report UCB/EECS-2010-13, 2010. 



31 



